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Abstract 

In many classification problems a classifier should be robust to small variations in 
the input vector. This is a desired property not only for particular transformations, 
such as translation and rotation in image classification problems, but also for all 
others for which the change is small enough to retain the object perceptually in¬ 
distinguishable. We propose two extensions of the backpropagation algorithm that 
train a neural network to be robust to variations in the feature vector. While the 
first of them enforces robustness of the loss function to all variations, the second 
method trains the predictions to be robust to a particular variation which changes 
the loss function the most. The second methods demonstrates better results, but 
is slightly slower. We analytically compare the proposed algorithm with two the 
most similar approaches (Tangent BP and Adversarial Training), and propose their 
fast versions. In the experimental part we perform comparison of all algorithms in 
terms of classification accuracy and robustness to noise on MNIST and CIFAR-10 
datasets. Additionally we analyze how the performance of the proposed algorithm 
depends on the dataset size and data augmentation. 


1 Introduction 

Neural networks are widely used in mac hine learning. For ex amp le, they are showing the 
best results in image cl assification (Szegedv et al. (201 3); iLee et all (120141) '). image labeling 
(Karn athv & Fei-Fei (120141) 4 and speech recognition. Deep neural networks applied to large datasets 
can automatically learn from a huge number of features, that allow them to represent very complex 
relations between raw input data and output classes. However, it also means that deep neural net¬ 
works can suffer from overfitting, and different regularization techniques are crucially important for 
good performance. 

It is often the case that there exist a number of variations of a given object that preserve its label. 
For example, image labels are usually invariant to small variations in their location on the image, 
size, angle, brightness, etc. In the area of voice recognition the result has to be invariant to the 
speech tone, speed and accent. Moreover, the predictions should always be robust to random noise. 
However, this knowledge is not incorporated in the learning process. 

In this work we propose two methods of achieving local invariance by extending the standard back- 
propagation algorithm. First of them enforces robustness of the loss function to all variations in the 
input vector. Second methods trains the predictions to be robust to variation of the input vector in 
the direction which changes the loss function the most. We refer to them as Loss Invariant Back- 
Propagation (Loss IBP), and Prediction IBP. While one of them is faster, the other one demonstrates 
better performance. Both methods can be applied to all types of neural networks in combination 
with any other regularization technique. 

‘http://www.demyanov.net 
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1.1 Backpropagation algorithm 

We denote K as the number of layers in a neural network and y l . i G K \ as the activation 

vectors of each layer. The activation of the first layer yo is the input vector x. If the input is an image 
that consists of one or more feature maps, we still consider it as a vector by traversing the maps and 
concatenating them together. The transformation between layers might be different: convolution, 
matrix multiplication, non-linear transformation, etc. We assume that y. t = fi(yi-i]Wi), where Wi 
is the set of weights, which may be empty. The computation of the layer activations is the first 
(forward) pass of the backpropagation algorithm. Moreover, the loss function L(yx) can also be 
considered as a layer yx+i of the length 1. The forward pass is thus a calculation of the composition 
of functions /k+i (/«■(■ • ■ /i(x)...)), applied to the input vector x. 

Let us denote the vectors of derivatives with respect to layer values dL/dyi as ciy,. Then, similar to 
the forward propagating functions y, = ffiyi- i; Wi), we can define backward propagating functions 
dyi -1 = fiidyi'jWi). We refer to them as reverse functions. According to the chain rule, we can 
obtain their matrix form: 

dyi -1 = fi{dyi\Wi) = dy t ■ Ji{yi-i]Wi), (1) 

where i) is the Jacobian, i.e. the matrix of the derivatives dy[ / dy\_ x . The backward pass 

is thus a consecutive matrix multiplication of the Jacobians ]~[-_^ +1 J t (y,- \) of layer functions 
fi{yi-i\ Wi), computed at the points yi-\. Note, that the first Jacobian Jk+i^k) is the vector of 
derivatives dyx = dL/dyn of the loss function L with respect to predictions ijk- The last vector 
dyo = n-= . K+1 Ji(yi-i) = V X L contains the derivatives of the loss function with respect to the 
input vector. 

Next, let us also denote the vector of weight gradients dL/dwi as dwi. Then we can write the chain 
rule for dwi in a matrix form as dwi = Jf (t/i-i; wfi) ■ dyi , where J™(yi-i',Wi) is the Jacobian 
matrix of the derivatives with respect to weights dy 3 /dw ^ 1 . However, if /) is a linear function, the 
Jacobian ./“ (r/i_i; uy) is equivalent to the vector yf_ lt so 

dwi =yj_ i ■ dyi (2) 

In this article we consider all layers with weights to be linear. 

After the dwi are computed, the weights are updated: Wi -f— Wi — a- dwi, \/i £ {1,K}, a > 0. 
Here a is the coefficient that specifies the size of the step in the opposite direction to the derivative, 
which usually reduces over time. 

2 Related work 

A number of techniques that allow to achieve robustness to particular variations have been proposed. 
Convolutional neural networks, which consist of pairs of convolutional and subsampling layers, 
are the most commonly used one. They provide robustness to small shifts and scaling, and also 
significantly reduce the number of training parameters compared to fully-connected perceptrons. 
However, they are not able to deal with other types of variations. Another popular method is data 
augmentation. It assumes training on the objects, artificially generated from the existing training set 
using the transformation functions. Unfortunately, such generation is not always possible. There 
exist two other approaches, which also attempt to solve this problem analytically using the gradients 
of the loss function with respect to input. We discuss them below. 

2.1 Tangent backpropagation algorithm 

The first approach is Tangent backpropagation algorithm (jSimard et al.l ((20121) ). which allows to 
train a network, robust to a set of predefined transformations. The authors consider some invariant 
transformation function g^x',9), s.t.g(x,0) = x, which must preserve the predictions p(g(x\6)) 
within a local neighborhood of 9 = 0. Since the predictions p(x) in this neighborhood must also be 
constant, a necessary condition for the network is 

Vep{g(x,9)) | e=0 = 0 
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To achieve this, the authors add a loss regularization term R(x ) to the main loss function L: 

Lmin{x ) = L(p(x)) + /3R(x) = L(p(x)) + PL(V e p(g{x,9))\ e =o), L(z) = ^\\z\\ r r (3) 

Using the chain rule we can get obtain the following representation for V op(g(x, #))| 0 =o : 

1 

Vgp(g(x,6))\e= 0 = S7 x p(x) ■ V e g(x;9 )| 0=o = JJ Ji(t/;_i) ■ Vgg(x; 6»)| e=0 

i=K 

The last term depends only on the function g(x\ 9) and the input value x, and therefore can be 
computed in advance. The authors refer to Vga; = Vgg(x; 0)| e _ o as tangent vectors. The authors 
propose to compute the additional loss term by initializing the network with a tangent vector V gx T 
and propagating it through a linearized network, i.e., consecutively multiplying it on the transposed 
Jacobians Jpyi-i)i={i,...,K}- Indeed, 

K 1 

V e x T • JJ Jj(yi-i) = JJ Ji(yi-i) • V e x = V x p(x ) • Vex = Vop(g(x, 0))|0 =o 

i = 1 i=K 

The main drawback of Tangent BP is computational complexity. As it can be seen from the defini¬ 
tion, it linearly depends on the number of transformations the classifier learns to be invariant to. The 
authors describe an example of training a network for image classification, which is robust to five 
transformations: two translations, two scalings, and rotation. In this case the required learning time 
is 6 times larger than for the standard BP. 

The usage of tangent vectors also makes Tangent BP more difficult to implement. To achieve this, 
the authors suggest to obtain a continuous image representation by applying a Gaussian filter, which 
requires additional preprocessing and one more hyperparameter (filter smoothness). While the ba¬ 
sic transformation operators are given by simple Lie operators, other transformations may require 
additional coding. 

2.2 Adversarial Training 

The second algorithm is a recently proposed Adversarial Training (iGoodfellow et alJ (120141) '). In 
(Szeg edv et al.1 (120131) ') the authors described an interesting phenomena: it is possible to artificially 
generate an image indistinguishable from the image of the dataset, such that a trained network’s 
prediction about it is completely wrong. Of course, people never make jiuch kinds of mistakes. 
These objects were called adversarial examples. In (IGoodfellow et al.1 d2014l) ~) the authors showed 
that it is possible to generate adversarial examples by moving into the direction given by the loss 
function gradient V x L(p(x)), i.e., 

x* (x; e) = x + e sign{S7 x L(jp{x))) (4) 

In a high dimensional space even a small move may significantly change the loss function L(p(x)). 
To deal with the problem of adversarial examples, the authors propose the algorithm of Adversarial 
Training (AT). The idea of the algorithm is to additionally train the network on the adversarial 
examples, which can be quickly generated using the gradients V x L(p( x)), obtained in the end of 
the backward pass. Adversarial Training uses the same labels l(x) for the new object x* as for the 
original object x, so the loss function L(p(x*(x\ e))) is the same. The updated loss function is thus 

L m in(x) = (L(p(x)) + L(p(x*(x; e))))/2 (5) 

Adversarial training is quite similar to the Tangent propagation algorithm, but differs in a couple of 
aspects. First, Adversarial training uses the gradients of the loss function V x L(p(x)), while Tangent 
BP uses tangent vectors X7qx t . Second, while Adversarial Training propagates the new objects 
through the original network. Tangent BP propagates the gradients V gx 1 through the linearized 
network. The proposed Prediction IBP algorithm can be also derived by combining these properties. 

3 Invariant backpropagation 

In the first part of this section we describe Loss IBP, which makes the main loss function robust to 
all variations in the input vector. In the second part we describe Prediction IBP, which aims to make 
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the network predictions robust to the variation in the direction specified by V x L(p(x)). While both 
versions use the gradients V x L(p(x)), they differ in their loss functions, computational complexity, 
and also in experimental results. 

3.1 LOSS IBP 

In many classification problems we have a large number of features. Formally it means that the 
input vectors y 0 come from a high dimensional vector space. In this space every vector can move in 
a huge number of directions, but most of them should not change the vector’s label. The goal of the 
algorithm is to make a classifier robust to such variations. 

Let us consider a K -layer neural network with an input x = yo, and predictions p(x) = ijk- 
Using the vector of true labels l(x), we compute the loss function L(p(x)) = yn+i, and at the 
end of the backward pass of backpropagation algorithm we obtain the vector of its gradients dyo = 
V x L(p(x)) = nL - K +i •A(y?:-i)■ This vector defines the direction that changes the loss function 
L{p{x)), and its length specifies how large this change is. In the small neighborhood we can assume 
that dL « dx- V x L T (p(x)). If V x L{p(x)) is small, then the same change of x, will cause a smaller 
change of L. Thus, a smaller vector length corresponds to a more robust the classifier, and vice 
versa. Let us specify the additional loss function 

L r (y x L{p(x))) = L r (dy 0 ) = -\\dy 0 \\ r r , dy 0 = ^ (6) 

r o(dy 0 ) 

which is computed at the end of the backward pass. In order to achieve robustness to variations, we 
need to make it as small as possible. By default we assume r = 1. 

Note that Ij-j idyo) is very similar to the Frobenius norm of the Jacobian matrix, which is used as 
a regularization term in contractive autoencoders dRifai et alj (1201 ll) ). The minimization of L(dyo) 
encourages the classifier to be invariant to changes of the input vector in all directions, not only 
those that are known to be invariant. At the same time, the minimization of L(p(x)) ensures that 
the predictions change when we move towards the samples of a different class, so the classifier is 
not invariant in these directions. The combination of these two loss functions aims to ensure good 
performance. In order to minimize the joint loss function 

Lmin{x ) = L(jp{x)) + pL(V x L(p(x))), (7) 

we need to additionally obtain the derivatives of the additional loss function with respect to the 
weights du>i = 'S7 Wi L(dyo)- In Section l331 we discuss how to efficiently compute them, using only 
one additional forward pass. Once these derivatives are computed, we can update the weights using 
the new rule 

Wi 4 — Wi — a(du>i + j3 ■ dwi) a > 0, /3 > 0, (8) 

Here /? is the coefficient that controls the strength of regularization, and plays a crucial role in 
achieving good performance. Note that when /? = 0, the algorithm is equivalent to the standard 
backpropagation. Since the additional loss function aims to minimize the gradients of the main loss 
function S7 x L(p(x)), we call this algorithm Loss IBP. 

3.2 Prediction IBP 

While Loss IBP makes the main loss function L(p(x)) robust to variations, it does not necessarily 
imply the robustness of the predictions p{x) themselves. Unfortunately we cannot compute the 
gradients of predictions with respect to the input vector as their dimensionality can be very large. 
However, we can compute the gradients of predictions in the direction given by V x L(p(x)). As 
it was shown in Section [T2l movement in this direction can generate adversarial examples, whose 
predictions significantly differ from x. We can thus introduce another additional loss function 

L r (X7 e p(x + eV x L(p(x )))| e=0 ) = L r {S7 x p(x) ■ V x L T (p(x))) (9) 

We call the algorithm with this loss function Prediction IBP. The only difference of Prediction IBP 
from Tangent BP is the initial vector for the third pass. While Tangent BP uses precomputed tangent 
vectors. Prediction IBP uses the vector of gradients V x L(p{x)), obtained at the end of the backward 
pass. The weight gradients of the additional loss function L can be computed the same way as they 
are computed in Tangent BP. Therefore, Prediction IBP always requires two times more computation 
time than standard BP. 
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Figure 1: The scheme represents three passes of Loss IBP algorithm. Two of them are the parts of 
standard backpropagation. It also shows which vectors are used for weight derivative computation. 


3.3 LOSS IBP IMPLEMENTATION 

In this section we will show how to efficiently compute the weight gradients for the additional loss 
function ©. To optimize L(dyo), we need to look at the backward pass from another point of view. 
We may consider that the derivatives dy k are the first layer of a reverse neural network that has dyo 
as its output. Indeed, all transformation functions fi have reverse pairs /, that are used to propagate 
the derivatives (|T|). If we consider these pairs as the original transformation functions, they have 

their own inverse pairs /j. 

Therefore we consider the derivatives dyi as activations and the backward pass as a forward pass for 
the reverse network. As in standard backpropagation, after such a “forward” pass we compute the 
loss function L(dyo). The next step is quite natural: we need to initialize the input vector yo with 
the gradients dyo = V dy 0 L{dya) and perform another “backward” pass that has the same direction 
as the original forward pass. At the same time the derivatives with respect to the weights dwi = 
V Wi L(dyo ) must be computed. Fig.Q]shows the general scheme of the derivative computation. The 
top part corresponds to the standard backpropagation procedure. 

An important subset of transformation functions /i(j/i_i; wf) is linear functions. It includes convo¬ 
lutional layers, fully connected layers, subsampling layers, and other types. In Section [7TTI we show 
that if a function fi is linear, i.e. /,:(i/ 7 ;-i; wf) = Iji-i ■ Wi then 

1. dyi = dy t -1 • w t , 

2. dwi = yf_i ■ dyu and du>i = dyf • dy t 

Therefore, in the case of a linear function fi, we can propagate third pass activations the same way 
as we do on the first pass, i.e., multiplying them on the same matrix of weights Wi. This statement 
remains true for element-wise multiplication, as it can be considered as matrix multiplication as well. 
The weight derivatives dw, are also computed the same way as dwi in the standard BP algorithm. 
This fact allows us to easily implement Loss IBP using the same procedures as for standard BP. 

Moreover, in Section [77X1 we also show that if the function fi{yi-i]wi) has a symmetric Jacobian 

Ji(jji-i\Wi), then fi(diji-i;Wi) = fi{dyi\Wi). This property is useful for implementation of the 
non-linear functions. The summary of the Loss IBP algorithm is given in AlgorithmQ] 

It is easy to compare the computation time for standard BP and Loss IBP. We know that convolution 
and matrix multiplication operations occupy almost all the processing time. As we see, IBP needs 
one more forward pass and one more calculation of weight gradients. If we assume that for each 
layer the forward pass, backward pass and calculation of derivatives all take approximately the same 
time, then IBP requires about 2/3 ss 66% more time to train the network. The experiments have 
shown that the additional time is about 50%. It is less than the approximated 66%, because both 
versions contain fixed time procedures such as batch composing, data augmentation, etc. At the 
same time Loss IBP is faster than Prediction IBP on approximately 20%. 
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4 Fast versions of Tangent BP and Adversarial Training 

4.1 Fast Tangent BP 

Let us change the additional loss function in Eq. (0 suc h that we penalize the sensitivity of the main 
loss function L(p[x)) instead of predictions p(x) themselves: 

R(x) = L(V e p(g(x,e))\ g=0 ) = V eP (g(x,6))\ e=0 ■ j£ +1 = V g L(p{g(x,0)))\ g=o (10) 

In this case the computations can be simplified. Notice, that 

l 

V e L(p(g(x, 0)))|0 =o = JJ Ji{yi-i) • Vqx = V x L(p(x)) ■ Vqx = dy 0 • V qx , 

i = K +1 

Therefore R{x) can be directly computed in the end of the backward pass by multiplying the gradient 
dyo on the tangent vector Vgx. In Section 17751 we show that this modification of Tangent BP is 
equivalent to Loss IBP with the additional loss function L(dyo) = dyo • Vgcc instead of L(dyo) = 
4||dj/o|lr- Therefore, this version of Tangent BP can be implemented using cs 20% less time than 
original Tangent BP. We refer to it as Fast TBP. 

4.2 Fast Adversarial Training 

Using Taylor expansion for the loss of adversarial example L(p(x*(x; e))), we can get 

L(p( x*(x\e))) = L(p(x + esign{S7 x L(p(x))))) = L(p(x)) + e||V a; .L(p(a :))|| 1 + o(e) (11) 
Combining 0 and (fill , we can approximate L m i n (x) as 

L(p(x)) + |||V a; L(p(x))||i + o(e) ss L(p(x + | sign(V x L(jp{x))))) = L(p(x*{x; |))) (12) 

It is easy to notice, that the usage of L m i n (x) instead of L{p{x*{x\ e))) just scales the hyperparam¬ 
eter e, which needs to be tuned anyway. At the same time, the calculation of gradients V Wi L(p(x)) 
takes computation time. Therefore, the Adversarial Training algorithm can be sped up by avoiding 
the calculation of S7 Wi L(p(x)), and using only the gradients V Wi L(p(x*)). Compared with the orig¬ 
inally proposed loss L m i n {x), the optimal parameter e must be 2 times lower. Similar to Tangent 
BP, this trick also saves ss 20%. 

Now we can see the difference between Loss IBP and Adversarial Training. While Loss IBP mini¬ 
mizes only the first derivative, and does not affect higher orders of the derivatives of the loss func¬ 
tions L(p(x)) such as curvature. Adversarial Training essentially minimizes all orders of the deriva¬ 
tives d n L(p(x))/d n x with the predefined weight coefficients between them. In the case of a highly 
nonlinear true data distribution P(y\x) this might be a disadvantage. In Section[5]we show that none 
of these algorithms outperform another one in all the cases. 

5 Experiments 

In the experimental part we compared all algorithms and their modifications in differen t aspects. We 
perfor med the experime nts on two benchm ark image classification datasets: MNIST dLeCun et al.l 
d 1998b ) and CIFAR-10 (tKrizhevskvl d2009b ) using the ConvNet toolbox for MatlabQ In all exper¬ 
iments we used the following parameters: 1) the batch size 32, 2) initial learning rate a = 0.1, 
3) momentum to = 0.9, 4) exponential decrease of the learning rate, i.e., a t = att-i ■ 7 , 5) each 
convolutional layer was followed by a scaling layer with max aggregation function among the region 
of size 3x3 and stride 2, 6 ) rein nonlinear functions on the internal layers, 7) final softmax layer 
combined with the negative log-likelihood loss function. We trained the classifiers for 80 epochs 
with the coefficient 7 = 0.98, so the final learning rate was 0.1 • 0.98 80 « 0.02. For the experi¬ 
ments on MNIST we employed a network with two convolutional layers with 32 filters of size 4x4 
(padding 0) and 64 filters of size 5x5 (padding 2) and one internal FC layer of length 256. The 
experiments on CIFAR were performed on the network with 3 convolutional layers with the filter 
size 5x5 (paddings 0, 2 and 2), and one internal FC layer of length 256. 

1 https://github.com/sdemyanov/ConvNet 
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In all our experiments we used L \ -norm additional loss function as we had found that it always 
works better that L^-norm. For Tangent BP algorithm we used 5 tangent vectors for each image in 
the training set, corresponding to x and y shifts, x and y scaling and rotation. The employed value of 
standard deviation for the Gaussian filter was a = 0.9. For numerical stability reasons we omitted 
multiplication on softmax gradients on additional forward and backward passes in Prediction IBP 
and Original TBP algorithms. 

5.1 Classification accuracy 

First we compared the performance of all algorithms and their modifications. We trained the net¬ 
works on 10 different subsets of MNIST and CIFAR-10 datasets of size 10000 with different initial 
weights and shuffling order. Each dataset was first normalized to have pixel values within [0 1] and 
then the mean pixel value was subtracted from all images. The results are presented in Table[l] 


Table 1: Mean errors (%), best parameters, and computation time of one epoch on MNIST and 
CIFAR-10 datasets for Standard backpropagation (BP), and two version of Invariant backpropaga- 
tion (IBP), Adversarial Training (AT) and Tangent backpropagation (TBP) each. 






MNIST 




CIFAR-10 



Error, % 

Best 13 or e 

Time, s 

Error, % 

Best /? or t 

Time, s 

Standard BP 

1.21 

± 

0.08 

N/A 

1.51 

34.7 

± 

0.6 

N/A 

2.84 

Prediction IBP 

0.90 

± 

0.10 

1.0 

2.64 

32.6 

± 

0.4 

0.1 

5.20 

Loss IBP 

1.09 

± 

0.11 

0.03 

2.25 

33.1 

± 

0.5 

0.003 

4.24 

Original AT 

0.89 

± 

0.05 

0.05 

2.66 

34.7 

± 

0.3 

0.0003 

5.40 

Fast AT 

0.89 

± 

0.06 

0.03 

2.28 

34.7 

± 

0.6 

0.0003 

4.78 

Original TBP 

1.07 

± 

0.12 

0.01 

7.47 

27.2 

± 

0.7 

0.1 

15.55 

Fast TBP 

1.21 

± 

0.08 

0.0003 

5.38 

34.7 

± 

0.3 

0.0003 

10.30 


First, we can see that all algorithms except Fast TBP can decrease classification error compared with 
the standard BP. We suppose that the lack of improvement by Fast TBP can be explained by a weak 
connection between the behavior of the loss function L(p(x )) and predictions p(x) themselves. 
While L{p(x)) is trained to be robust to predefined transformations, the predictions p(x) might 
remain sensitive to them. Further we discuss only Original TBP. 

Second, we can notice that Original and Fast AT demonstrate identical performance, thus confirming 
our suggestion about a possibility to speed up the algorithm. The achieved speed up is 17% and 13%. 
We can also see that the best parameters of e for MNIST datasets differ in ss 2 times, what was also 
predicted by our considerations. Further we do not differentiate between Original and Fast AT, and 
refer to them as AT. 

Third, we can conclude that Prediction IBP shows better results than Loss IBP (improvement on 
26% vs 10% on MNIST and 6.1% vs 4.6% on CIFAR), while being slightly slower (on 17% and 
23% accordingly). Since Prediction IBP can be seen as a modification of Original TBP, while Loss 
IBP is equivalent to a modification of Past TBP, the reason might also be a weak connection between 
L(p(x)) andp(a:). 

Forth, we observe that the algorithms demonstrate different performance on MNIST and CIFAR-10 
datasets. The best results on MNIST are achieved by Prediction IBP and AT, while the best result 
on CIFAR-10 is achieved by Tangent BP. Notice, that the improvement of Tangent BP on CIFAR-10 
dataset (22%) is much larger, than the next best result of Prediction IBP (6.1%). At the same time, 
AT algorithm could not improve the accuracy at all, achieving the best accuracy using the lowest 
possible value of the parameter e = 0.0003. However, the Tangent BP algorithm works much slower 
than the competitors. 

We suppose that such results can be explained by a high non-linearity of a decision function. As 
it was shown in Section l4~2l AT minimizes not only the first order of the loss function derivatives, 
but also all other orders, thus preventing the classifier from learning such non-linearity. At the same 
time. Prediction IBP just makes the predictions less sensitive to variations in the input vector, spec¬ 
ified by V x L{p(x)). In the case of highly non-linear decision function this might be not necessary. 
Unlike both AT and IBP, Tangent BP uses prior knowledge to train invariance in directions that the 
predictions must always be invariant to. This allows it to achieve the best performance on CIFAR-10. 
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5.2 Robustness to Adversarial noise 



(a) MNIST dataset (b) CIFAR-10 dataset 

Figure 2: Errors of competing algorithms on test sets, corrupted by different levels adversarial noise 

We next measured the sensitivity of all algorithms to adversarial noise. We employed the classifiers 
trained in Section [5TTI with the parameters, which yield the best accuracy, and measured performance 
of the classifiers on the test sets, corrupted by adversarial noise. Adversarial examples were gener¬ 
ated using Eq. ©. The results are presented in Fig. [3 where we show the errors at the variation of e. 
It is important to keep in mind that performance of the classifiers significantly depends on the value 
of a regularization parameter. 

Firstly notice that CIFAR-10 classifiers are much more sensitive to adversarial noise than those 
trained on MNIST dataset. As expected, the most robust classifier was trained by Adversarial Train¬ 
ing algorithm. It is the only one which constantly remains better than standard BP classifier. Other 
classifiers show better results until a certain point, when the level of noise becomes too high. In¬ 
terestingly, while Tangent BP demonstrates the best results on CIFAR-10 dataset, its performance 
degrades much faster than the performance of other classifiers on both MNIST and CIFAR-10. Note, 
that despite the ratio of best /3 values for Prediction IBP and Loss IBP is the same in both cases, they 
demonstrate different behavior. 

5.3 Robustness to Gaussian noise 



(a) MNIST dataset (b) CIFAR-10 dataset 


Figure 3: Errors of competing algorithms on test sets, corrupted by different levels Gaussian noise 

After that we measured the sensitivity of the same classifiers to Gaussian noise. The results are pre¬ 
sented in Fig. [3] Surprisingly, the most robust classifier on MNIST dataset was trained by standard 
BP. We thus see that robustness to adversarial noise and other predefined transformations makes a 
classifier more sensitive to Gaussian noise. At the same time. Tangent BP classifier remains the most 
sensitive to Gaussian noise as well. On CIFAR-10 dataset it is the only classifier which degrades 
significantly faster than others. 
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5.4 Dataset size and Data augmentation 




— X — Standard BP, no augmentation 

— X- — Prediction IBP, no augmentation 
—©— Standard BP, with augmentation 

— ©- — Prediction IBP, with augmentation 
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(a) Classification errors 



(b) Optimal ft values 


Figure 4: Performance of standard BP and Prediction IBP on different size subsets of MNIST dataset 
with and without data augmentation. The optimal ft values are provided on the right plot. 

We have also established how the dataset size and data augmentation affects the Prediction IBP 
improvement. We performed these experiments on subsets of the MNIST dataset using the same 
parameters as in Section IQ In data augmentation regime we randomly modified each training 
object every time it was accessed according to the following parameters: 1) range of shift from 
the central position in each dimension - [—2, 2] pixels, 3) range of scaling in each dimension - 
[0.7, 1.4], 3) range of rotation angle - [—18, 18] degrees, 4) pixel value if the pixel is out of the 
original image - 0. In order to decrease the variance we trained the networks for 100 epochs without 
data augmentation and for 150 epochs with it. 

The results are summarized in Fig.|4] We see that without data augmentation smaller datasets require 
more regularization, i.e., larger It. The relative improvement is also higher: it is 43% for lk samples 
and 18% for 60 k. We thus see that the larger the dataset is, the less the network overfits, and the 
less improvement we can obtain from regularization. With data augmentation the improvement of 
IBP is less, but does not converge to 0 even when the full training set is used. Interestingly, the 
optimal value of ft remains approximately on the same level for all dataset sizes. Therefore we 
can conclude that data augmentation cannot completely substitute IBP regularization as the last one 
enforces robustness to variations, which are not represented by additionally generated objects. 

6 Conclusion 

We proposed two versions of the Invariant Backpropagation algorithm, which extends the standard 
Backpropagation in order to enforce robustness of a classifier to variations in the input vector. While 
Loss IBP trains the main loss function to be insensitive to any variations. Prediction IBP trains the 
predictions to be insensitive to variations in the direction of the gradient V x L(p(x)). We have 
demonstrated that the weight gradients for Loss IBP can be efficiently computed using only one 
additional forward pass, which is identical to the original forward pass for the majority of layer 
types. We experimentally established that Prediction IBP achieves higher classification accuracy on 
both MNIST and CIFAR-10 datasets, but requires « 20% more time than Loss IBP. Additionally 
we proposed fast versions for both Tangent BP and Adversarial Training algorithms. While the fast 
version of Tangent BP does not improve classification accuracy, the modification of Adversarial 
Training algorithm demonstrates the same performance as the originally proposed algorithm, being 
~ 15% faster. 

In the experimental part we performed comparison of all algorithms and their modifications in terms 
of classification accuracy and robustness to noise. We have found that none of the algorithms out¬ 
performs others in all cases. While the best results on MNIST are achieved by Prediction IBP and 
Adversarial Training, Tangent BP significantly outperformed others on CIFAR-10. At the same time 
Tangent BP classifier is the most sensitive to Gaussian and Adversarial noise on both datasets. Ad¬ 
ditionally we demonstrated that the regularization effect of Prediction IBP remains visible even on 
the full size MNIST dataset with data augmentation, so the methods can be applied together. The 
choice of a particular regularizer depends on the properties of a dataset. 
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7 Supplementary material 


7.1 Reverse function theorems 


First, let us notice that the forward and backward passes of Loss IBP are performed in the same way 
as in the standard backpropagation algorithm. Then the additional loss function © is computed, 
and its derivatives are used as input for the propagation on the third pass. As it follows from ©, for 
r = 2 the gradients are 


j- idIMyolli 
dm ~ 2-Wml 


dyo , 


i.e., coincide with the gradients dyo = V x L(p(x)). Forp = 1, they are the signs of dyo'. 


dyo 


d\\dyo\\i 
diydyo) 


= sign(dyo) 


In Section 13731 we described double reverse functions f(dyi-i;Wi). Let us additionally introduce 
functions gi and their reverse pairs gi as 

dwi = gi(yi-i,dyi), and dvti = <ji(dyi-i,dyi) 

Now we can prove the following theorems. 

Theorem 1. Let us assume that fi is linear, i.e., fi(yi~ i; Wi) = yt-i ■ Wi, where matrix multiplica¬ 
tion is used. Then 
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1- fi = fi, ie., dyi = fi(dyi-i;w) = dy z -1 • w t , 

2. §i = 9i, i-e., dwi = yj_ x ■ dy it and dw t = dyj_ x ■ dy t 

Proof. First, notice that the reverse of any function is always linear: 

dyi- 1 = fi(dyi;wi ) = dyi ■ J i {y l -i,w i ) (13) 

In the case of a linear function /, the reverse function f , is known: 

dyi -1 = dyi ■ Ji{yi-\\Wi) = dyi ■ wj (14) 

Now let us consider the double reverse functions f(dy;w ), such that dyi = fi(dyi-i;wi). Com¬ 
pared with linear /, its reverse function / multiplies its first argument on the transposed parameter. 
The same is true for the double reverse function / compared with /, i.e.: 

dyi = fi(dyi-i;wi ) = dyi -1 • (■ wf) T = dyi -i • Wi 

This proves the first statement. 

Next, in the case of linear function fi we also know the function (jif yi-i, d/yf) which computes the 
weight derivatives dwi ([2}: 

dwi = gi { yi - i , dyi ) = yf_ x ■ dyi . (15) 

Let us again consider the backward pass fi as the forward pass for the reverse net. Since the function 
fi is linear, the formula for derivative calculation of reverse net is also ©. However, as it follows 
from (ITTb the reverse net uses the transposed matrix of weights for forward propagation, so the result 
of the derivative calculation is also transposed with respect to the matrix Wi. Also note that since 
dyi acts as activations in the reverse net, we pass it as the first argument, and dy t - \ as the second. 
Therefore, 

dm = gi(dyi, dyi-if = (dyf ■ dy i - 1 ) T = dyj_ x ■ dyi , (16) 

and this proves the part 2. □ 

Theorem 2. If the function fify-i-i ; w t ) has a symmetric Jacobian Ji(yi—i',Wi), then 
fi(dyi-i\Wi ) = fifdyywi). 

Proof. Indeed, from ( | 13 1 we know that any reverse function is linear, and its argument is multiplied 
on the Jacobian Ji(yi-i' : Wi). From Theorem[I]we also know that the reverse for a linear function 
multiplies its argument of the transposed set of weights, i.e. dyi = dyi- 1 • Jf (yi-i; Wi). Therefore, 

if the Jacobian is symmetric, then ffdyi-i'Wi) = f^dyywf). □ 

7.2 Implementation of particular layer types 

A fully connected layer is a standard linear layer, which transforms its input by multiplication 
on the matrix of weights: y.j = yi-\ ■ Wi + bi, where bi is the vector of biases. Notice that on the 
backward pass we do not add any bias to propagate the derivatives, so we do not add it on the third 
pass as well and do not compute additional bias derivatives. This is the difference between the first 
and the third passes. If dropout is used, the third pass should use the same dropout matrix as used 
on the first pass. 

Non-linear activation functions can be considered as a separate layer, even if they are usually 
implemented as a part of each layer of the other type. They do not contain weights, so we write just 
f(z). The most common functions are: (i) sigmoid, f(z) = 1/(1 + e~ z ), (ii) rectified linear unit 
{rein), f{z) = max(z, 0), and (iii) softmax, f(zi) = e Zi / e Zj . All of them are differentiable 
(except relit in 0, but it does not cause uncertainty) and have a symmetric Jacobian matrix, so ac¬ 
cording to Theorem[2]the third pass is the same the backward pass. For example, in the case of the 
relu function this means that dyi = dyi- \ * I (yi-i > 0), where element-wise multiplication is used. 
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Convolution layers perform 2D filtering of the activation maps with the matrices of weights. 
Since each element of iji is a linear combination of elements of yi-u convolution is also a lin¬ 
ear transformation. Linearity immediately gives that fi(dyi-i;wi) = fi(yi-i',Wi) and duii = 
dyf_ 1 ■ dyi. Therefore the third pass of convolutional layer repeats its first pass, i.e., it is performed 
by convolving dyi-\ with the same filters using the same stride and padding. As with the fully 
connected layers, we do not add biases to the resulting maps and do not compute their derivatives. 

The scaling layer aggregates the values over a region to a single value. Typical aggregation 
functions fi(yi~ 1 ) are mean and max. As it follows from their definition, both of them also perform 
linear transformations, so dyi = fi(diji- 1 ). Notice that in the case of the max function it means that 
on the third pass the same elements of dy,- \ should be chosen for propagation to dy % as on the first 
pass regardless of what value they have. 


Algorithm 1 Invariant backpropagation: a single batch processing description 

1. Perform standard forward and backward passes, and compute the derivatives dw for the 
main loss function. 

2. Perform additional forward pass using the derivatives dyo or signs sign(dyo) as activations. 
On this pass: 

• do not add biases to activations 

• use backward versions of non-linear functions 

• on max-pooling layers propagate the same positions as on the first pass 

3. Compute the derivatives dw for the additional loss function L the same way as dw. Initial¬ 
ize the bias derivatives dw to 0. 

4. Update the weights according to Eq. [8] 


7.3 Regularization properties of Loss IBP 

In the case of L 2 loss function ([6]), we can derive some interesting theoretical properties. Using the 
Cauchy-Schwarz inequality, we can obtain, that 

l|V*£|| 2 — I xdK 1 12 ' I I 2 — I xVK—i 1 12 ' 11 1 S/Js" 11 2 ' I -^11 2 

The most common loss functions for the predictions p(x) = y k and true labels l(x) are the squared 
loss L{p{x)) = \ Y^i=i(Pi( x ) ~ k(x)) 2 and the cross-entropy loss — h{%) log Pi(x), applied 

to the softmax output layer pi(x) = 4>{ z i) = e. Zi / X^=i eZj ■ Here M is the number of neurons in 
the output layer (number of classes), and 2 = yK-i- In the first case we have V VK L = p(x) — l(x), 
in the second case we can show that V vk _ 1 L = p(x) — l(x). Therefore, the strength of L 2 -function 
Loss IBP regularization decreases when the predictions p(x) approach the true labels l(x). This 
property prevents overregularization when the classifier achieves high accuracy. Notice, that if a 
network has no hidden layers, then V x yK -2 = w, i.e., in this case ||V X L ||2 penalty term can be 
considered as a weight decay regularizer, multiplied on p(x) — l(x). 

For the model of a single neuron we can derive another interesting property. In (iBishopi d 1995b I it 
was demonstrated that for a single neuron with the norm loss function noise injection is equiva¬ 
lent to the weight decay ||w|| 2 regularization. In Section [T4l we show, that if the negative log-loss 
function is used, noise injection becomes equivalent to the Loss IBP regularizer. 


7.4 Noise injection 


Assuming Gaussian noise y ~ N(0,a 2 I), such that E[y] = 0 and E[y T y] = a 2 1, we can get 
approximate an arbitrary loss function L{p{x)) as 


E[L(p(x + y))] « E 


L{p{x)) + S/ x L{p(x))y T + \yH{x)y 


= l {p{ x )) + -rr T r(H(x)), 
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where Tr(H(x)) is the trace of the Hessian matrix H, consisting of the second derivatives of 
L{p{x)) with respect to the elements of x. Solving the differential equation 

=E(S) ! =nv, i ia 

2=1 L 2=1 x ' 

for each Xi independently, we can find the following solution: 


L 


N N 

l(x) In | ^2 XiWi + b\ + (1 — l(x)) In 11 - ^ XiWi - b\ , 
2 = 1 2=1 


where l = l(x) £ {0,1} is the class label for the object x. Indeed, assuming p = Y^iLi x i w i + 
we obtain the first derivatives: 


dL 
d Xi 


<+ + (i-D 

±p 


T Wi 


1 2 


±(1 ~P) 


= W? 


P~l 
P( 1 ~P) 


iP 2 — 2 pi + l 2 

' Ji P 2 (1~P) 2 


(17) 


Now we can compute the second derivatives: 

d 2 L _ d [ p-l 
dx | dxi Wl p(l—p) 

Notice, that the last expression uses l instead of l 2 . However if l € {0,1}, then l = l 2 , so the 
expressions <o and (fl8l > are equal. Therefore, when the negative log-likelihood function L is 
applied to a single neuron without a non-linear transfer function, the Gaussian noise, added to the 
input vector x, is equivalent to the IBP regularization term | V., : L 11 j. This result is supported by the 
discussion in lFawzi et al.l d2015l) . where the authors show that for the linear classifier the robustness 
to adversarial examples is bounded from below by the robustness to random noise. However, since 
Tr(H(x )) is only the expected value, the quality of approximation also depends on the number of 
iterations. 


2 p 2 — 2pl + l 
Ul p 2 (l~p) 2 


(18) 


7.5 Equivalence of Loss IBP and Fast TBP 


In Section |4TI we showed that the gradient VgL{p(g(x,9)))\g—o can be efficiently computed by 
multiplying the gradient dyo = \ 7 x L(p(x )), obtained at the end of the backward pass, on the tangent 
vector Vga;. We can demonstrate that Loss IBP with the additional loss function L(dyo) = dyo-Vgx 
is equivalent to Fast Tangent BP with the additional loss function ( fTOl ). 


In Fast Tangent BP we perform an additional iteration of backpropagation through the linearized 
network, applied to a tangent vector Vgx. The additional forward pass computes the following 
values: 

i 

Vi = V gx T ■ Jj, R(x) = y K +1 
.7=1 

On the additional backward pass the computed gradients dyi = dli(x)/dy, are therefore 


d§i = 


dR(x) 

dyK 


2+1 

n 


j=K 


According to (0, the weight gradients are then 


n 


j=K +1 


dm =yj_ i • dyi 


n 


*+1 


Jj ~ 


1 2+1 
JJ Jj ' ■ JJ Jj 


\ j -1 J j=K+1 j=i -1 j=K +1 

We thus see that in order to compute additional weight derivatives dm, we need to compute the 
cumulative Jacobian products from both sides of the network. 


Let us now compute the same gradients du>i for Loss IBP with L(dyo) = dyo ■ Vgx. In this case we 
initialize the third pass by the tangent vector V gx T = dL{dyo)/d(dyo)- Thus the third pass values 
are 

dyi = S/gx T ■ JJ Jj 

j =i 
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According to (IT6l >. the gradients are 


2 — 1 


dxbi = dyf_ 1 ■ dy t = [ V@x T • J, 

i =i 


2+1 1 2+1 

II -I,” II Jr Vex- J] J .’ 

j=K+l j=i—l j=K+1 


Therefore, the weight gradients of both algorithms are the same, so the algorithms are equivalent. 
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